Load packages:
library(tidyverse)
library(ggplot2) # superfluous because ggplot2 is part of tidyverse
library(haven)
library(labelled)Resources used to create this lecture:
We will use two datasets that are part of the ggplot2 package:
mpg: EPA fuel economy data in 1999 and 2008 for 38 car models that had a new release every year between 1999 and 2008
diamonds: Prices and attributes of about 54,000 diamonds#?mpg
glimpse(mpg)## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "aud…
## $ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattr…
## $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.…
## $ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999…
## $ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8…
## $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", …
## $ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4…
## $ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, …
## $ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, …
## $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class <chr> "compact", "compact", "compact", "compact", "compact…
#?diamonds
glimpse(diamonds)## Rows: 53,940
## Columns: 10
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.2…
## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good…
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, …
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, …
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.…
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 6…
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 34…
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.0…
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.0…
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.3…
We will use public-use data from the National Center for Education Statistics (NCES) Educational Longitudinal Survey (ELS) of 2002:
stu_id uniquely identifies observations# variables we want to select from full ELS dataset
els_keepvars <- c(
"STU_ID", # student id
"STRAT_ID", # stratum id
"PSU", # primary sampling unit
"BYRACE", # (base year) race/ethnicity
"BYINCOME", # (base year) parental income
"BYPARED", # (base year) parental education
"BYNELS2M", # (base year) math score
"BYNELS2R", # (base year) reading score
"F3ATTAINMENT", # (3rd follow up) attainment
"F2PS1SEC", # (2nd follow up) first institution attended
"F3ERN2011", # (3rd follow up) earnings from employment in 2011
"F1SEX", # (1st follow up) sex composite
"F2EVRATT", # (2nd follow up, composite) ever attended college
"F2PS1LVL", # (2nd follow up, composite) first attended postsecondary institution, level
"F2PS1CTR", # (2nd follow up, composite) first attended postsecondary institution, control
"F2PS1SLC" # (2nd follow up, composite) first attended postsecondary institution, selectivity
)
els_keepvars## [1] "STU_ID" "STRAT_ID" "PSU" "BYRACE"
## [5] "BYINCOME" "BYPARED" "BYNELS2M" "BYNELS2R"
## [9] "F3ATTAINMENT" "F2PS1SEC" "F3ERN2011" "F1SEX"
## [13] "F2EVRATT" "F2PS1LVL" "F2PS1CTR" "F2PS1SLC"
load(url("https://github.com/anyone-can-cook/rclass2/raw/main/data/els/els.RData"))
els <- els %>%
# keep only subset of vars
select(one_of(els_keepvars)) %>%
# lower variable names
rename_all(tolower)
glimpse(els)## Rows: 16,197
## Columns: 16
## $ stu_id <dbl> 101101, 101102, 101104, 101105, 101106, 101107, 1011…
## $ strat_id <dbl> 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 10…
## $ psu <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ byrace <dbl+lbl> 5, 2, 7, 3, 4, 4, 4, 7, 4, 3, 3, 4, 3, 2, 2, 3, …
## $ byincome <dbl+lbl> 10, 11, 10, 2, 6, 9, 10, 10, 8, 3, 8, 8, 5, 8, 1…
## $ bypared <dbl+lbl> 5, 5, 2, 2, 1, 2, 6, 2, 2, 1, 6, 4, 4, 2, 7, 2, …
## $ bynels2m <dbl+lbl> 47.84, 55.30, 66.24, 35.33, 29.97, 24.28, 45.16,…
## $ bynels2r <dbl+lbl> 39.04, 36.35, 42.68, 27.86, 13.07, 11.70, 19.66,…
## $ f3attainment <dbl+lbl> 3, 10, 6, 4, 4, 3, 4, 6, -4, 3, 3, 3, 5, 5, 6, -…
## $ f2ps1sec <dbl+lbl> -8, 1, 1, 4, 4, -3, 4, 2, -4, 4, 1, -4, -4, 4, 2…
## $ f3ern2011 <dbl+lbl> 4000, 3000, 37000, 1500, 48000, 35000, 17000, 68…
## $ f1sex <dbl+lbl> 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, …
## $ f2evratt <dbl+lbl> -8, 1, 1, 1, 1, 0, 1, 1, -4, 1, 1, -4, -4, 1, 1,…
## $ f2ps1lvl <dbl+lbl> -8, 1, 1, 2, 2, -3, 2, 1, -4, 2, 1, -4, -4, 2, 1…
## $ f2ps1ctr <dbl+lbl> -8, 1, 1, 1, 1, -3, 1, 2, -4, 1, 1, -4, -4, 1, 2…
## $ f2ps1slc <dbl+lbl> -5, -5, -5, -5, -5, -5, -5, -5, -5, -5, -5, -5, …
els %>% var_label()## $stu_id
## [1] "Student ID"
##
## $strat_id
## [1] "Stratum"
##
## $psu
## [1] "Primary sampling unit"
##
## $byrace
## [1] "Student's race/ethnicity-composite"
##
## $byincome
## [1] "Total family income from all sources 2001-composite"
##
## $bypared
## [1] "Parents' highest level of education"
##
## $bynels2m
## [1] "ELS-NELS 1992 scale equated sophomore math score"
##
## $bynels2r
## [1] "ELS-NELS 1992 scale equated sophomore reading score"
##
## $f3attainment
## [1] "Highest level of education earned as of F3"
##
## $f2ps1sec
## [1] "Sector of first postsecondary institution"
##
## $f3ern2011
## [1] "2011 employment income: R only"
##
## $f1sex
## [1] "F1 sex-composite"
##
## $f2evratt
## [1] "Whether has ever attended a postsecondary institution - composite"
##
## $f2ps1lvl
## [1] "Level of offering of first postsecondary institution"
##
## $f2ps1ctr
## [1] "Control of first postsecondary institution"
##
## $f2ps1slc
## [1] "Institutional selectivity of first attended postsecondary institution"
Basic definitions:
race)The seven parameters of the layered grammar of graphics consists of:
ggplot2 – part of tidyverse – is an R package to create graphics and ggplot() is a function within the ggplot2 package.
“In practice, you rarely need to supply all seven parameters to make a graph because ggplot2 will provide useful defaults for everything except the data, the mappings, and the geom function.” (Wickham & Grolemund, 2017, Chapter 3)
Syntax conveying the seven parameters of the layered grammer of graphics:
ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) +
<GEOM_FUNCTION>(
stat = <STAT>,
position = <POSITION>
) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>What does Wickham mean by layers? (from “Telling Stories with Data Using the Grammar of Graphics” by Liz Sander)
The five layers of the grammar of graphics:
Data defines the information to be visualized.
Example: Imagine a dataset where each observation is a student
bynels2m), earnings in 2011 (f3ern2011), and student sex (f1sex)glimpse(els)## Rows: 16,197
## Columns: 16
## $ stu_id <dbl> 101101, 101102, 101104, 101105, 101106, 101107, 1011…
## $ strat_id <dbl> 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 10…
## $ psu <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ byrace <dbl+lbl> 5, 2, 7, 3, 4, 4, 4, 7, 4, 3, 3, 4, 3, 2, 2, 3, …
## $ byincome <dbl+lbl> 10, 11, 10, 2, 6, 9, 10, 10, 8, 3, 8, 8, 5, 8, 1…
## $ bypared <dbl+lbl> 5, 5, 2, 2, 1, 2, 6, 2, 2, 1, 6, 4, 4, 2, 7, 2, …
## $ bynels2m <dbl+lbl> 47.84, 55.30, 66.24, 35.33, 29.97, 24.28, 45.16,…
## $ bynels2r <dbl+lbl> 39.04, 36.35, 42.68, 27.86, 13.07, 11.70, 19.66,…
## $ f3attainment <dbl+lbl> 3, 10, 6, 4, 4, 3, 4, 6, -4, 3, 3, 3, 5, 5, 6, -…
## $ f2ps1sec <dbl+lbl> -8, 1, 1, 4, 4, -3, 4, 2, -4, 4, 1, -4, -4, 4, 2…
## $ f3ern2011 <dbl+lbl> 4000, 3000, 37000, 1500, 48000, 35000, 17000, 68…
## $ f1sex <dbl+lbl> 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, …
## $ f2evratt <dbl+lbl> -8, 1, 1, 1, 1, 0, 1, 1, -4, 1, 1, -4, -4, 1, 1,…
## $ f2ps1lvl <dbl+lbl> -8, 1, 1, 2, 2, -3, 2, 1, -4, 2, 1, -4, -4, 2, 1…
## $ f2ps1ctr <dbl+lbl> -8, 1, 1, 1, 1, -3, 1, 2, -4, 1, 1, -4, -4, 1, 2…
## $ f2ps1slc <dbl+lbl> -5, -5, -5, -5, -5, -5, -5, -5, -5, -5, -5, -5, …
els %>% select(stu_id,bynels2m,f3ern2011,f1sex) %>% as_factor() %>% head(10)## [38;5;246m# A tibble: 10 x 4[39m
## stu_id bynels2m f3ern2011 f1sex
## [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<fct>[39m[23m [3m[38;5;246m<fct>[39m[23m [3m[38;5;246m<fct>[39m[23m
## [38;5;250m 1[39m [4m1[24m[4m0[24m[4m1[24m101 47.84 4000 Female
## [38;5;250m 2[39m [4m1[24m[4m0[24m[4m1[24m102 55.3 3000 Female
## [38;5;250m 3[39m [4m1[24m[4m0[24m[4m1[24m104 66.24 37000 Female
## [38;5;250m 4[39m [4m1[24m[4m0[24m[4m1[24m105 35.33 1500 Female
## [38;5;250m 5[39m [4m1[24m[4m0[24m[4m1[24m106 29.97 48000 Female
## [38;5;250m 6[39m [4m1[24m[4m0[24m[4m1[24m107 24.28 35000 Male
## [38;5;250m 7[39m [4m1[24m[4m0[24m[4m1[24m108 45.16 17000 Male
## [38;5;250m 8[39m [4m1[24m[4m0[24m[4m1[24m109 66.01 68000 Male
## [38;5;250m 9[39m [4m1[24m[4m0[24m[4m1[24m110 28.28 Nonrespondent Male
## [38;5;250m10[39m [4m1[24m[4m0[24m[4m1[24m111 38.85 42000 Male
Mapping defines how variables in a dataset are applied (mapped) to a graphic.
Example: Consider the previous dataset
els %>% select(stu_id,bynels2m,f3ern2011,f1sex) %>%
rename(x=bynels2m, y=f3ern2011, color=f1sex) %>%
as_factor() %>% head(10)## [38;5;246m# A tibble: 10 x 4[39m
## stu_id x y color
## [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<fct>[39m[23m [3m[38;5;246m<fct>[39m[23m [3m[38;5;246m<fct>[39m[23m
## [38;5;250m 1[39m [4m1[24m[4m0[24m[4m1[24m101 47.84 4000 Female
## [38;5;250m 2[39m [4m1[24m[4m0[24m[4m1[24m102 55.3 3000 Female
## [38;5;250m 3[39m [4m1[24m[4m0[24m[4m1[24m104 66.24 37000 Female
## [38;5;250m 4[39m [4m1[24m[4m0[24m[4m1[24m105 35.33 1500 Female
## [38;5;250m 5[39m [4m1[24m[4m0[24m[4m1[24m106 29.97 48000 Female
## [38;5;250m 6[39m [4m1[24m[4m0[24m[4m1[24m107 24.28 35000 Male
## [38;5;250m 7[39m [4m1[24m[4m0[24m[4m1[24m108 45.16 17000 Male
## [38;5;250m 8[39m [4m1[24m[4m0[24m[4m1[24m109 66.01 68000 Male
## [38;5;250m 9[39m [4m1[24m[4m0[24m[4m1[24m110 28.28 Nonrespondent Male
## [38;5;250m10[39m [4m1[24m[4m0[24m[4m1[24m111 38.85 42000 Male
A statistical transformation transforms the underlying data before plotting it.
Example: Imagine creating a scatterplot of the relationship between HS math test score (x-axis) and 2011 income (y-axis)
els %>% select(stu_id,bynels2m,f3ern2011) %>% rename(x=bynels2m, y=f3ern2011) %>%
as_factor() %>% head(10)## [38;5;246m# A tibble: 10 x 3[39m
## stu_id x y
## [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<fct>[39m[23m [3m[38;5;246m<fct>[39m[23m
## [38;5;250m 1[39m [4m1[24m[4m0[24m[4m1[24m101 47.84 4000
## [38;5;250m 2[39m [4m1[24m[4m0[24m[4m1[24m102 55.3 3000
## [38;5;250m 3[39m [4m1[24m[4m0[24m[4m1[24m104 66.24 37000
## [38;5;250m 4[39m [4m1[24m[4m0[24m[4m1[24m105 35.33 1500
## [38;5;250m 5[39m [4m1[24m[4m0[24m[4m1[24m106 29.97 48000
## [38;5;250m 6[39m [4m1[24m[4m0[24m[4m1[24m107 24.28 35000
## [38;5;250m 7[39m [4m1[24m[4m0[24m[4m1[24m108 45.16 17000
## [38;5;250m 8[39m [4m1[24m[4m0[24m[4m1[24m109 66.01 68000
## [38;5;250m 9[39m [4m1[24m[4m0[24m[4m1[24m110 28.28 Nonrespondent
## [38;5;250m10[39m [4m1[24m[4m0[24m[4m1[24m111 38.85 42000
Example: Imagine creating a bar chart of the number of students by race/ethnicity
els %>% count(byrace) %>% as_factor()## [38;5;246m# A tibble: 9 x 2[39m
## byrace n
## [3m[38;5;246m<fct>[39m[23m [3m[38;5;246m<int>[39m[23m
## [38;5;250m1[39m Survey component legitimate skip/NA 305
## [38;5;250m2[39m Nonrespondent 648
## [38;5;250m3[39m Amer. Indian/Alaska Native, non-Hispanic 130
## [38;5;250m4[39m Asian, Hawaii/Pac. Islander,non-Hispanic [4m1[24m460
## [38;5;250m5[39m Black or African American, non-Hispanic [4m2[24m020
## [38;5;250m6[39m Hispanic, no race specified 996
## [38;5;250m7[39m Hispanic, race specified [4m1[24m221
## [38;5;250m8[39m More than one race, non-Hispanic 735
## [38;5;250m9[39m White, non-Hispanic [4m8[24m682
Graphs visually display data, using geometric objects like a point, line, bar, etc.
Position adjustment adjusts the position of visual elements in the plot so that these visual elements do not overlap with one another in ways that make the plot difficult to interpret.
Example: The dataset mpg (included in the ggplot2 package) contains variables for the specifications of different cars, with 234 observations
ggplot(data = mpg, mapping = aes(x = cyl, y = hwy)) +
geom_point()jitter position adjustment “adds a small amount of random variation to the location of each point” (from ?geom_jitter)ggplot(data = mpg, mapping = aes(x = cyl, y = hwy)) +
geom_point(position = "jitter")“A coordinate system maps the position of objects onto the plane of the plot, and controls how the axes and grid lines are drawn. Plots typically use two coordinates (x,y), but could use any number of coordinates.” (Grammar of Graphics)
Example: Cartesian coordinate system
x1 <- c(1, 10)
y1 <- c(1, 5)
p <- qplot(x = x1, y = y1, geom = "blank", xlab = NULL, ylab = NULL) +
theme_bw()
p +
ggtitle(label = "Cartesian coordinate system")ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot() +
coord_flip()Example: Polar coordinate system
p +
coord_polar() +
ggtitle(label = "Polar coordinate system")Facets are subplots that display one subset of the data. They are most commonly used to create “small multiples”
Example: Imagine creating a scatterplot of the relationship between number of cylinders in the engine (x-axis) and highway miles-per-gallon (y-axis), with separate subplots for car class (e.g., midsize, minivan, pickup, suv)
ggplot(data = mpg) +
geom_point(mapping = aes(x = cyl, y = hwy), position = "jitter") +
facet_wrap(~ class, nrow = 2)ggplotggplot() and aes() functionsShow help pages for package ggplot2:
help(package = ggplot2)The ggplot() function:
?ggplot
# SYNTAX AND DEFAULT VALUES
ggplot(data = NULL, mapping = aes())ggplot() initializes a ggplot object. It can be used to declare the input data frame for a graphic and to specify the set of plot aesthetics intended to be common throughout all subsequent layers unless specifically overridden”data: Dataset to use for plot. If not specified in ggplot() function, must be supplied in each layer added to the plot.mapping: Default list of aesthetic mappings to use for plot. If not specified, must be supplied in each layer added to the plot.The aes() function (often called within the ggplot() function):
?aes
# SYNTAX
aes(x, y, ...)ggplot() and in individual layers.”x, y, ...: List of name value pairs giving aesthetics to map to variables
x and y aesthetics are typically omitted because they are so commonExample: Putting ggplot() and aes() together
ggplot() and aes() without specifying a geom layer (e.g., geom_point()) creates a blank ggplot:ggplot(data = diamonds, aes(x = carat, y = price))ggplot(data = diamonds, mapping = aes(x = carat, y = price))data argument of ggplot():class(diamonds)## [1] "tbl_df" "tbl" "data.frame"
diamonds %>% ggplot(mapping = aes(x = carat, y = price))diam_ggplot <- ggplot(data = diamonds, aes(x = carat, y = price))
diam_ggplot # blank ggplottypeof(diam_ggplot)## [1] "list"
class(diam_ggplot)## [1] "gg" "ggplot"
str(diam_ggplot)## List of 9
## $ data : tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
## ..$ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## ..$ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## ..$ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## ..$ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## ..$ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## ..$ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## ..$ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## ..$ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## ..$ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## ..$ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
## $ layers : list()
## $ scales :Classes 'ScalesList', 'ggproto', 'gg' <ggproto object: Class ScalesList, gg>
## add: function
## clone: function
## find: function
## get_scales: function
## has_scale: function
## input: function
## n: function
## non_position_scales: function
## scales: NULL
## super: <ggproto object: Class ScalesList, gg>
## $ mapping :List of 2
## ..$ x: language ~carat
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## ..$ y: language ~price
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## ..- attr(*, "class")= chr "uneval"
## $ theme : list()
## $ coordinates:Classes 'CoordCartesian', 'Coord', 'ggproto', 'gg' <ggproto object: Class CoordCartesian, Coord, gg>
## aspect: function
## backtransform_range: function
## clip: on
## default: TRUE
## distance: function
## expand: TRUE
## is_free: function
## is_linear: function
## labels: function
## limits: list
## modify_scales: function
## range: function
## render_axis_h: function
## render_axis_v: function
## render_bg: function
## render_fg: function
## setup_data: function
## setup_layout: function
## setup_panel_guides: function
## setup_panel_params: function
## setup_params: function
## train_panel_guides: function
## transform: function
## super: <ggproto object: Class CoordCartesian, Coord, gg>
## $ facet :Classes 'FacetNull', 'Facet', 'ggproto', 'gg' <ggproto object: Class FacetNull, Facet, gg>
## compute_layout: function
## draw_back: function
## draw_front: function
## draw_labels: function
## draw_panels: function
## finish_data: function
## init_scales: function
## map_data: function
## params: list
## setup_data: function
## setup_params: function
## shrink: TRUE
## train_scales: function
## vars: function
## super: <ggproto object: Class FacetNull, Facet, gg>
## $ plot_env :<environment: R_GlobalEnv>
## $ labels :List of 2
## ..$ x: chr "carat"
## ..$ y: chr "price"
## - attr(*, "class")= chr [1:2] "gg" "ggplot"
attributes(diam_ggplot)## $names
## [1] "data" "layers" "scales" "mapping" "theme"
## [6] "coordinates" "facet" "plot_env" "labels"
##
## $class
## [1] "gg" "ggplot"
diam_ggplot$mapping## Aesthetic mapping:
## * `x` -> `carat`
## * `y` -> `price`
diam_ggplot$labels## $x
## [1] "carat"
##
## $y
## [1] "price"
Adding a geometric layer to a ggplot object dictates how observations are displayed in the plot.
geom_point(): creates a scatterplotgeom_bar(): creates a bar chartgeom_point()Scatterplots are most useful for showing the relationship between two continuous variables.
Example: Scatterplot of the relationship between carat and price, using the diamonds dataset
#ggplot(data = diamonds, aes(x = carat, y = price)) + geom_point()
ggplot(data = diamonds, mapping = aes(x = carat, y = price)) + geom_point()diam_ggplot + geom_point()Example: Scatterplot of the relationship between high school math test score (bynels2m) and 2011 earnings (f3ern2011), using the els dataset
els %>% select(bynels2m,f3ern2011) %>%
summarize_all(.funs = list(~ mean(., na.rm = TRUE), ~ min(., na.rm = TRUE), ~ max(., na.rm = TRUE)))## [38;5;246m# A tibble: 1 x 6[39m
## bynels2m_mean f3ern2011_mean bynels2m_min f3ern2011_min bynels2m_max
## [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m
## [38;5;250m1[39m 44.3 [4m2[24m[4m1[24m276. -[31m8[39m -[31m8[39m 79.3
## [38;5;246m# … with 1 more variable: f3ern2011_max <dbl>[39m
els %>% select(bynels2m) %>% filter(bynels2m<0) %>% count(bynels2m)## [38;5;246m# A tibble: 1 x 2[39m
## bynels2m n
## [3m[38;5;246m<dbl+lbl>[39m[23m [3m[38;5;246m<int>[39m[23m
## [38;5;250m1[39m -[31m8[39m[38;5;246m [Survey component legitimate skip/NA][39m 305
els %>% select(bynels2m) %>% filter(bynels2m<0) %>% count(bynels2m) %>% as_factor()## [38;5;246m# A tibble: 1 x 2[39m
## bynels2m n
## [3m[38;5;246m<fct>[39m[23m [3m[38;5;246m<int>[39m[23m
## [38;5;250m1[39m Survey component legitimate skip/NA 305
els %>% select(f3ern2011) %>% filter(f3ern2011<0) %>% count(f3ern2011)## [38;5;246m# A tibble: 2 x 2[39m
## f3ern2011 n
## [3m[38;5;246m<dbl+lbl>[39m[23m [3m[38;5;246m<int>[39m[23m
## [38;5;250m1[39m -[31m8[39m[38;5;246m [Survey component legitimate skip/NA][39m 459
## [38;5;250m2[39m -[31m4[39m[38;5;246m [Nonrespondent][39m [4m2[24m488
els %>% select(f3ern2011) %>% filter(f3ern2011<0) %>% count(f3ern2011) %>% as_factor()## [38;5;246m# A tibble: 2 x 2[39m
## f3ern2011 n
## [3m[38;5;246m<fct>[39m[23m [3m[38;5;246m<int>[39m[23m
## [38;5;250m1[39m Survey component legitimate skip/NA 459
## [38;5;250m2[39m Nonrespondent [4m2[24m488
NA:els_v2 <- els %>%
mutate(
hs_math = if_else(bynels2m<0,NA_real_,as.numeric(bynels2m)),
earn2011 = if_else(f3ern2011<0,NA_real_,as.numeric(f3ern2011)),
)
#check
els_v2 %>% filter(bynels2m<0) %>% count(bynels2m, hs_math)## [38;5;246m# A tibble: 1 x 3[39m
## bynels2m hs_math n
## [3m[38;5;246m<dbl+lbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<int>[39m[23m
## [38;5;250m1[39m -[31m8[39m[38;5;246m [Survey component legitimate skip/NA][39m [31mNA[39m 305
els_v2 %>% filter(f3ern2011<0) %>% count(f3ern2011, earn2011)## [38;5;246m# A tibble: 2 x 3[39m
## f3ern2011 earn2011 n
## [3m[38;5;246m<dbl+lbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<int>[39m[23m
## [38;5;250m1[39m -[31m8[39m[38;5;246m [Survey component legitimate skip/NA][39m [31mNA[39m 459
## [38;5;250m2[39m -[31m4[39m[38;5;246m [Nonrespondent][39m [31mNA[39m [4m2[24m488
els_v2 %>% count(bypared) %>% as_factor()## [38;5;246m# A tibble: 11 x 2[39m
## bypared n
## [3m[38;5;246m<fct>[39m[23m [3m[38;5;246m<int>[39m[23m
## [38;5;250m 1[39m Missing 49
## [38;5;250m 2[39m Survey component legitimate skip/NA 179
## [38;5;250m 3[39m Nonrespondent 648
## [38;5;250m 4[39m Did not finish high school 944
## [38;5;250m 5[39m Graduated from high school or GED [4m3[24m053
## [38;5;250m 6[39m Attended 2-year school, no degree [4m1[24m666
## [38;5;250m 7[39m Graduated from 2-year school [4m1[24m597
## [38;5;250m 8[39m Attended college, no 4-year degree [4m1[24m758
## [38;5;250m 9[39m Graduated from college [4m3[24m468
## [38;5;250m10[39m Completed Master's degree or equivalent [4m1[24m786
## [38;5;250m11[39m Completed PhD, MD, other advanced degree [4m1[24m049
els_parphd <- els_v2 %>% filter(bypared==8)ggplot(data= els_parphd, aes(x = hs_math, y = earn2011)) + geom_point()The geom_point() function:
?geom_point
# SYNTAX AND DEFAULT VALUES
geom_point(mapping = NULL, data = NULL, stat = "identity",
position = "identity", ..., na.rm = FALSE, show.legend = NA,
inherit.aes = TRUE)geom_point() understands (i.e., accepts) the following aesthetics (required aesthetics in bold)
x, y, alpha, colour, fill, group, shape, size, strokegeom_bar()) accepts a different set of aestheticsExample: Scatterplot of the relationship between engine displacement (displ) and highway miles-per-gallon (hwy), using the mpg dataset
class):ggplot(data = mpg, aes(x = displ, y = hwy, color = class)) +
geom_point()color aesthetic can be specified within geom_point():ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class))Student Task: Using the els_parphd dataset, create a scatterplot of the relationship between HS math score (hs_math) on the x-axis and 2011 earnings (earn2011) on the y-axis, with the color of points determined by sex (f1sex)
aes() expects the color aesthetic to be a factor variable:ggplot(data= els_parphd, aes(x = hs_math, y = earn2011, color = f1sex)) + geom_point()ggplot(data= els_parphd, aes(x = hs_math, y = earn2011, color = as_factor(f1sex))) + geom_point()geom_smooth()Why use geom_smooth()?
ggplot(data = els_v2, aes(x = hs_math, y = earn2011)) + geom_point()geom_smooth() creates smoothed prediction lines with shaded confidence intervals:ggplot(data = els_v2, aes(x = hs_math, y = earn2011)) + geom_smooth()The geom_smooth() function:
?geom_smooth
# SYNTAX AND DEFAULT VALUES
geom_smooth(mapping = NULL, data = NULL, stat = "smooth",
position = "identity", ..., method = "auto", formula = y ~ x,
se = TRUE, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE)stat), as compared to that of geom_point():
stat = "smooth" for geom_smooth()stat = "identity" for geom_point()geom_smooth() accepts the following aesthetics (required aesthetics in bold)
x, y, alpha, colour, fill, group, linetype, size, weight, ymax, yminExample: Smoothed prediction lines for high school math test score (bynels2m) versus 2011 earnings (f3ern2011), using the els dataset
ggplot():ggplot(data=els_v2) + geom_smooth(mapping = aes(x = hs_math, y = earn2011))group aesthetic to create separate prediction lines by sex (f1sex):#ggplot(data=els_v2, aes(x = hs_math, y = earn2011, group=as_factor(f1sex))) + geom_smooth()
ggplot(data=els_v2) + geom_smooth(mapping = aes(x = hs_math, y = earn2011, group=as_factor(f1sex)))linetype aesthetic to create separate prediction lines (with different line styles) by sex (f1sex):#ggplot(data=els_v2, aes(x = hs_math, y = earn2011, linetype=as_factor(f1sex))) + geom_smooth()
ggplot(data=els_v2) + geom_smooth(mapping = aes(x = hs_math, y = earn2011, linetype=as_factor(f1sex)))color aesthetic to create separate prediction lines (with different colors) by sex (f1sex):#ggplot(data=els_v2, aes(x = hs_math, y = earn2011, color=as_factor(f1sex))) + geom_smooth()
ggplot(data=els_v2) + geom_smooth(mapping = aes(x = hs_math, y = earn2011, color=as_factor(f1sex)))Example: Layer smoothed prediction lines (geom_smooth()) on top of scatterplot (geom_point())
ggplot(data= els_v2) +
geom_point(mapping = aes(x = hs_math, y = earn2011)) +
geom_smooth(mapping = aes(x = hs_math, y = earn2011))ggplot(data= els_v2, aes(x = hs_math, y = earn2011)) +
geom_point() +
geom_smooth()+ xlim() and + ylim():ggplot(data= els_v2, aes(x = hs_math, y = earn2011)) +
geom_point() +
geom_smooth() +
xlim(c(20,80)) + ylim(c(0,100000))f1sex) on top of scatterplot with different point colors by sex:ggplot(data= els_v2) +
geom_point(mapping = aes(x = hs_math, y = earn2011, color = as_factor(f1sex))) +
geom_smooth(mapping = aes(x = hs_math, y = earn2011, linetype = as_factor(f1sex))) +
xlim(c(20,80)) + ylim(c(0,100000))geom_bar() and geom_col()Bar charts are used to plot a single, discrete variable.
Two geom functions to create bar charts:
geom_bar(): The height of each bar represents the number of cases (i.e., observations) in the group
geom_bar() when using (for example) student-level data and you don’t want to summarize student-level data prior to creating the chartgeom_col(): The height of each bar represents the value of some variable for the group
geom_col() when you have already created an object of summary statistics (e.g., counts, mean value, etc.)The geom_bar() and geom_col() functions:
?geom_bar
# SYNTAX AND DEFAULT VALUES
geom_bar(mapping = NULL, data = NULL, stat = "count",
position = "stack", ..., width = NULL, binwidth = NULL,
na.rm = FALSE, show.legend = NA, inherit.aes = TRUE)
?geom_col
# SYNTAX AND DEFAULT VALUES
geom_col(mapping = NULL, data = NULL, position = "stack", ...,
width = NULL, na.rm = FALSE, show.legend = NA,
inherit.aes = TRUE)Example: Bar chart with the variable cut (e.g., “Fair,” “Good,” “Ideal”) as x-axis and number of diamonds as y-axis, using the diamonds dataset
diamonds %>% count(cut)## [38;5;246m# A tibble: 5 x 2[39m
## cut n
## [3m[38;5;246m<ord>[39m[23m [3m[38;5;246m<int>[39m[23m
## [38;5;250m1[39m Fair [4m1[24m610
## [38;5;250m2[39m Good [4m4[24m906
## [38;5;250m3[39m Very Good [4m1[24m[4m2[24m082
## [38;5;250m4[39m Premium [4m1[24m[4m3[24m791
## [38;5;250m5[39m Ideal [4m2[24m[4m1[24m551
Method 1: Create bar chart using geom_bar()
ggplot(data = diamonds, aes(x = cut)) +
geom_bar()Method 2: Create bar chart using geom_col()
cut:cut_count <- diamonds %>% count(cut)
cut_count## [38;5;246m# A tibble: 5 x 2[39m
## cut n
## [3m[38;5;246m<ord>[39m[23m [3m[38;5;246m<int>[39m[23m
## [38;5;250m1[39m Fair [4m1[24m610
## [38;5;250m2[39m Good [4m4[24m906
## [38;5;250m3[39m Very Good [4m1[24m[4m2[24m082
## [38;5;250m4[39m Premium [4m1[24m[4m3[24m791
## [38;5;250m5[39m Ideal [4m2[24m[4m1[24m551
ggplot() + geom_col to plot the data from the object cut_count:ggplot(data = cut_count, aes(x = cut, y=n)) +
geom_col()cut_count object first:#diamonds %>% count(cut) %>% str()
diamonds %>% count(cut) %>% str()## tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
## $ cut: Ord.factor w/ 5 levels "Fair"<"Good"<..: 1 2 3 4 5
## $ n : int [1:5] 1610 4906 12082 13791 21551
diamonds %>% count(cut) %>% ggplot(aes(x= cut, y=n)) +
geom_col()Student Task: Using the els_v2 dataset, create a bar chart with the variable “ever attended postsecondary education” (f2evratt) as x-axis and number of students as y-axis
els_v2 %>% count(f2evratt) %>% as_factor()## [38;5;246m# A tibble: 5 x 2[39m
## f2evratt n
## [3m[38;5;246m<fct>[39m[23m [3m[38;5;246m<int>[39m[23m
## [38;5;250m1[39m Survey component legitimate skip/NA 359
## [38;5;250m2[39m Nonrespondent [4m1[24m691
## [38;5;250m3[39m Item legitimate skip/NA 108
## [38;5;250m4[39m No [4m3[24m505
## [38;5;250m5[39m Yes [4m1[24m[4m0[24m534
Method 1: Create bar chart using geom_bar()
ggplot(data = els_v2, aes(x = as_factor(f2evratt))) +
geom_bar()f2evratt before plotting:els_v2 %>% filter(f2evratt>=0) %>% ggplot(aes(x = as_factor(f2evratt))) +
geom_bar()Method 2: Create bar chart using geom_col()
els_v2 %>%
# filter to remove missing values
filter(f2evratt>=0) %>%
# use count() to create summary statistics object
count(f2evratt) %>%
# plot summary statistic object
ggplot(aes(x=as_factor(f2evratt), y=n)) + geom_col()Facets divide a plot into subplots based on the values of one or more discrete variables. They are most commonly used to create “small multiples”
Two functions to split your plots into facets:
facet_grid(): Display subplots in grid format, where rows and columns are determined by the faceting variable(s)
facet_grid() is most useful when you have two discrete variables, and all combinations of the variables exist in the datafacet_wrap(): Display all subplots side-by-side, but can be wrapped to fill multiple rows
facet_wrap() generally has better use of screen space, and you can specify the number of plots in each row or columnThe facet_grid() and facet_wrap() functions:
?facet_grid
# SYNTAX AND DEFAULT VALUES
facet_grid(rows = NULL, cols = NULL, scales = "fixed",
space = "fixed", shrink = TRUE, labeller = "label_value",
as.table = TRUE, switch = NULL, drop = TRUE, margins = FALSE,
facets = NULL)
?facet_wrap
# SYNTAX AND DEFAULT VALUES
facet_wrap(facets, nrow = NULL, ncol = NULL, scales = "fixed",
shrink = TRUE, labeller = "label_value", as.table = TRUE,
switch = NULL, drop = TRUE, dir = "h", strip.position = "top")Specifying which variable(s) to facet your plot on:
facet_grid()
facet_grid() arranges subplots in a grid format, we need to specify how we define the rows and columnsrows and cols arguments, which should be variables quoted by vars()
facet_grid(rows = vars(<var_1>), cols = vars(<var_2>)): facet into both rows and columnsfacet_grid(rows = vars(<var_1>)): facet into rows onlyfacet_grid(cols = vars(<var_1>)): facet into columns only<row_var> ~ <col_var>
facet_grid(<var_1> ~ <var_2>): facet into both rows and columnsfacet_grid(<var_1> ~ .): facet into rows onlyfacet_grid(. ~ <var_1>): facet into columns onlyfacet_wrap()
facet_wrap() also accepts a formula for its facets argument
facet_wrap(~ <var_1>): facet by one variablefacet_wrap(<var_1> ~ <var_2>): facet on the combination of two variablesExample: Scatterplot of the relationship between engine displacement (displ) and highway miles-per-gallon (hwy), faceted by number of cylinders (cyl), from the mpg dataset
Method 1: Faceting using facet_grid()
# Facet into rows
ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_grid(rows = vars(cyl))# Facet into columns
ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_grid(cols = vars(cyl))# Facet into rows
ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_grid(cyl ~ .)# Facet into columns
ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_grid(. ~ cyl)Method 2: Faceting using facet_wrap()
facet_grid(), facet_wrap() is not restricted to either rows or columns:ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(~ cyl)ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(~ cyl, nrow = 1)ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(~ cyl, ncol = 1)Example: Scatterplot of the relationship between engine displacement (displ) and highway miles-per-gallon (hwy), faceted by number of cylinders (cyl) and type of car (class), from the mpg dataset
Method 1: Faceting using facet_grid()
cyl and the columns based on class:ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_grid(rows = vars(cyl), cols = vars(class))ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_grid(cyl ~ class)Method 2: Faceting using facet_wrap()
facet_wrap() is not defined by rows and columns, it omits any subplots that do not display any data:ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(cyl ~ class)ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(cyl ~ class, nrow = 3)ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(cyl ~ class, ncol = 4)The plots generated by ggplot can be exported as a PDF, PNG, or other file types. (From Creating and Saving Graphs - R Base Graphs)
In RStudio, the generated plots will typically be displayed in the lower right panel. There is an Export button that allows you to save the plot as a PDF or PNG:
There are also various R functions, including jpeg(), png(), svg(), and pdf(), for exporting plots.
The steps for saving a plot:
height and width for specifying image dimensiondev.off()Example: Exporting plot using pdf()
# Open the file
pdf('Rplot.pdf')
# Create the plot
ggplot(data = mpg, aes(x = displ, y = hwy, color = class)) +
geom_point()
# Close the file
dev.off()Example: Exporting plot using jpeg()
# Open the file
jpeg('Rplot.jpg', width = 350, height = 350)
# Create the plot
ggplot(data = mpg, aes(x = displ, y = hwy, color = class)) +
geom_point()
# Close the file
dev.off()